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Abstract 



We consider the problem of reconstructing an unknown function / on a domain X from samples of 
/ at n randomly chosen points with respect to a given measure px- Given a sequence of linear spaces 
(V m )m>o with dim(V m ) = m < n, we study the least squares approximations from the spaces V m - It 
is well known that such approximations can be inaccurate when m is too close to n, even when the 
samples are noiseless. Our main result provides a criterion on m that describes the needed amount 
of regularization to ensure that the least squares method is stable and that its accuracy, measured in 
L 2 (X,px), is comparable to the best approximation error of / by elements from V m . We illustrate 
this criterion for various approximation schemes, such as trigonometric polynomials, with px being the 
uniform measure, and algebraic polynomials, with px being either the uniform or Chebyshev measure. 
For such examples we also prove similar stability results using deterministic samples that are equispaced 
with respect to these measures. 

1 Introduction and main results 

Let X be a domain of R d and px be a probability measure on X. We consider the problem of estimating 
an unknown function / : X — > R from samples (yi)i=i,..., n which are either noiseless or noisy observations of 
/ at the points (xi)i—i t ... in , where the Xi are i.i.d. with respect to px- We measure the error between / and 
its estimator / in the L 2 (X, p x ) norm 



and we denote by (•, •) the associated inner product. 

Given a fixed sequence of finite dimensional spaces (y m )m>i of L 2 (X,p x ) such that dim(V m ) = m. 
We would like to compute the best approximation of / in V m . This is given by the L 2 (X,px) orthogonal 
projector onto V m , which we denote by P m : 




x 



P m f := argmin||/ - v\\. 



vev r 



We let 



e m (f) = ||/ 



- P m f\\ 



denote the best approximation error. 
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In general, we may not have access to either px or any information about / aside from the observations 
at the points (zEi)i=i n - In this case we cannot explicitly compute P m f. A natural approach in this setting 
is to consider the solution of the least squares problem 

n 

w = argmin V" \Vi - v(x l )\ 2 . 

vev m i=l 

Typically, we are interested in the case where m < n which is the regime where this problem may admit a 
unique solution. 

In the noiseless case yi = f(xi), and hence w may be viewed as the application of the least squares 
projection operator onto V m to /, i.e., we can write 

w = p rn.f ■= argmin \\f -v\\ n 
vev m 

where 

i n 1/2 

ni» : = uEk^i 2 ) 

i=l 

is the L 2 norm with respect to the empirical measure and, analogously, (•, •)„ the associated empirical inner 
product. 

It is well known that least squares approximations may be inaccurate even when the measured samples 
are noiseless. For example, if V m is the space P m _i of algebraic polynomials of degree m — 1 over the 
interval [—1, 1] and if we choose m = n, this corresponds to Lagrange interpolation, which is known to be 
highly unstable, failing to converge towards / when given values at uniformly spaced samples, even when / 
is infinitely smooth (the "Runge phenomenon"). Regularization by taking to substantially smaller than n 
may therefore be needed even in a noise- free context. The goal of this paper is to provide a mathematical 
analysis on the exact needed amount of such regularization. 

Stability of the least squares problem. The solution of the least squares problem can be computed by 
solving an m x to system: specifically, if (L±, . . . , L m ) is an arbitrary basis for V m , then we can write 

m 

W = J2 U 3 L ji 
3=1 

where u = (tij)j=i,...,rn is the solution of the to x to system 

Gu = f, (1.1) 

with G := ((£j,£fc) n )j,k=i,...,m and f = (A ^"=1 yi L k{xi))k=i,...,m- In the noiseless case = f(xi), so that 
we can also write f := ((/, Lk) n )k=i,...,m- In the event that G is singular, we simply set w — 0. 

For the purposes of our analysis, suppose that the basis (Li,...,L m ) is orthonormal in the sense of 
L 2 (X, px)^\ In this case we have 

E(G) = «L J -,L fe )),- fc=1) ... im = I. 

1 While such a basis is generally not accessible when px is unknown, we require it only for the analysis. The actual 
computation of the estimator can be made using any known basis of V m , since the solution w is independent of the basis used 
in computing it. 
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Our analysis requires an understanding of how the random matrix G deviates from its expectation I in 
probability. Towards this end, we introduce the quantity 

I a 

K(m) := sup^li^)! 2 . 

Note that the function y] _ 1 |Lj (a;)| 2 is invariant with respect to a rotation applied to {L\, . . . ,L m ) and 
therefore independent of the choice of the orthonormal basis: it only depends on the space V m and on the 
measure px, and hence K(m) also depends only on V m and px- Also note that 

m 

K(m) > J2 \\Lj f =m. 

3 = 1 

We also will use the notation 

IMvl 

v#0 |V| 

for the spectral norm of a matrix. 

Our first result is a probabilistic estimate of the comparability of the norms || • || and || • ||„ uniformly 
over the space V m . This is equivalent to the proximity of the matrices G and I in spectral norm, since we 
have that for all 6 e [0, 1], 

|||G-I||| <5^\\\v\\ 2 n -\\v\\ 2 \<S\\v\\ 2 , veV m . 



Theorem 1 For < S < 1, one has the estimate 

Pr{|||G-I||| >S}=Pv{3veV m : 1 1|< - \\v\\ 2 \ > 5\\v\\ 2 } < 2mexp j-^v }, (1-2) 



K{m) 



where c$ := 8 + (1 - 5) log(l - S) > 0. 



The proof of Theorem [T] is a simple application of tail bounds for sums of random matrices obtained in 
PP. A consequence of this result is that the norms || • || and || • ||„ are comparable with high probability if 
K(m) is smaller than n by a logarithmic factor: for example taking 6 = h, we find that for any r > 0, 

Pr||||G-I|||>i}=Pr|3«€K, : | \\vf n - \\v\\ 2 \ > i|M| 2 j < 2n\ (1.3) 

if m is such that 

K{m) < K — — , with k := ^ = 1 ' 1 ° g2 . (1.4) 

v logn l + r 2 + 2r y ' 

The above condition thus ensures that G is well conditioned with high probability. It can also be thought 
of as ensuring that the least squares problem is stable with high probability. Indeed the right side of the 
least squares system can be written as f = My with 

M = -(ij(a;i))j,ie{i,...,m}x{i,...,n}, 
an m x n matrix. Observing that (Gv,v) = rt|M T v| 2 , we find that 

|||M||| = |||Ml| = (i|||G|| lV2 
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Therefore, if |||G — I||| < \, then we have that for any data vector y the solution w — Y^jLi u j^j satisfies 

H^ll ^ |u| < HIG- 1 |||.||| M |||-|y|<-^2y|| y |, 
which thus gives the stability estimate 

1/2 



\\w 

■.n 

i=i 



In the noiseless case, this can be written as H-P^/ll — C||/IUi i-e., the least squares projection is stable 
between the norms || ■ ||„ and || • ||. Note that since K(m) not only depends on V m but also on the measure 



px, the range of m such that the condition (1.4 1 holds is strongly tied to the choice of the measure. This 
issue is illustrated further in our numerical experiments. 

Let us mention that similar probabilistic bounds have been previously obtained, see in particular §5.2 in 



P]. These earlier results allow us to obtain the bound (1.3|, however relying on the stronger condition 

K{m) < (r^) 1/2 . 
Vlog(n) / 

The numerical results for polynomial least squares that we present in §3 hint that the weaker condition 
K{m) < log "„) is sharp. The quantity K(m) was also used in [3] in order to control the L°°(X) norm and 
the L 2 (px) norm. 

Accuracy of least squares approximation. As an application, we can derive an estimate for the error 
of least squares approximation in expectation. Here, we make the assumption that a uniform bound 

\f(x)\<L, (1.5) 

holds for almost every x with respect to px- For m < n, we consider the truncated least squares estimator 

/ = T L (w), 

where T^(t) = sign(i) max{L, \t\}. Our first result deals with the noiseless case. 



Theorem 2 In the noiseless case, for any r > 0, if m is such that the condition (1.4) holds, then 

E(||/ - /I 2 ) < (1 + s(n))e m (f) 2 + 8L 2 n~ r , (1.6) 



where s(n) := iog ^ n ^ - >■ as n — > +oo, with n as in (1.4) 



At this point a few remarks are due regarding the implications of this result in terms of the convergence 
rate of the estimate. 

Consider the following general setting of regression on a random design: we observe independent samples 

Zi = (Xi,yi)i=l,...,n (1-7) 

of a variable z = (x,y) of law p over X xY and marginal law px over X, and we want to estimate from 
these samples the regression function defined as the conditional expectation 

f(x);=E(y\x). (1.8) 
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We assume that the maximal variance 



^:=supE(|tf-/(aOP x), (1.9) 

is bounded. We thus think of the y^ as noisy observations of / at Xi with additive noise of variance at most 
a 2 , namely 

Vi = f(xi)+TH, (1.10) 
where the r\i are independent realizations of the variable r\ :— y — f{x). 



Assuming that / satisfies the uniform bound (1.5), one computes the truncated least squares estimator 



now with yi in place of f(xi). A typical convergence bound for this estimator, see for example Theorem 11.3 
in [6], is 

E(||/-/|| 2 )<C'(e m (/) 2 +max{L 2 ,a 2 } ! ^P). (1.11) 

Convergence rates may be found after balancing the two terms, but they are limited by the optimal learning 
rate n _1 , and this limitation persists even in the noiseless case a 2 = due to the presence of L 2 in the right 
side of ( |l.ll| . In contrast, Theorem [2] yields fast convergence rates, provided that the approximation error 



e m has fast decay and that the value of m satisfying (1.4 1 can be chosen large enough 



One motivation for studying the noiseless case is the numerical treatment of parameter dependent PDEs 
of the general form 

where a; is a vector of parameters in some compact set V € M. d . We can consider the solution map x f{x) 
either as giving the exact solution to the PDE for the given value of the parameter vector x or as the exact 
result of a numerical solver for this value of x. In the stochastic PDE context, x is random and obeys 
a certain law which may be known or unknown. From a random draw (xi)i—i t „, in , we obtain solutions 
fi = f(xi) which are noiseless observations of the solution map, and are interested in reconstructing this 
map. In instances such as elliptic problems with parameters in the diffusion coefficients, the solution map can 
be well-approximated by polynomials in x (see [1]). In this context, an initial study of the needed amount 
of regularization was given in [7|, however specifically targeted towards polynomial least squares. 

For the noisy regression problem described above, our analysis can also be adapted in order to derive the 
following result. 



Theorem 3 For any r > 0, if m is such that the condition (1.4) holds, then 



£(11/ - /T) < (1 + 2e(n))e m (/) 2 + 8L 2 n - r + 8a 2 -, (1.12) 



with e(n) as in Theorem^ and a is the maximal variance given by (1.9). 



In the noiseless case, the bound in Theorem [2] suggests that m should be chosen as large as possible under 
the constraint that (1.4) holds. In the noisy case, the value of m minimizing the bound in Theorem [3] also 



depends on the decay of e m , which is generally unknown. In such a situation, a classical way of choosing the 
value of to is by a model selection procedure, such as adding a complexity penalty in the least squares or 
using an independent validation sample. Such procedures can also be of interest in the noiseless case when 
the measure px is unknown, since the maximal value of m such that ( |1.4[ ) holds is then also unknown. 

Let us give an example of how the results in Theorems [2] and [3] lead to specific rates of convergence in 
terms of the number of samples: assume that X = [—1,1] is equipped with the uniform measure px = % 
and that V m = P m -i is the space of algebraic polynomials of degree m — 1. Then, if / belongs to C r (X) the 
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space of r-times differentiable functions, it is well-known that e m (/) 2 < m~ 2r . On the one hand the results 
in §3 show that condition (1.4) can be ensured with m ~ (nj log n) 1 / 2 . Therefore, in the noiseless case, we 
obtain a bound proportional to n~ r for the mean squared error, up the logarithmic factor. In the noisy case, 
after balancing the approximation and variance terms, we obtain a bound proportional to a 2r ^ r+1 ^n^ r ^ r+1 \ 
On the other hand, these rates can be improved with r replaced by 2r if we use the Chebyshev non-uniform 
measure that concentrates near the end-points, since in that case the results in §3 show that condition (1.4) 
can be ensured with m ~ nj logn. 

The rest of our paper is organized as follows: we give the proofs of the above results in §2 and we present 
in §3 examples of applications to classical approximation schemes such as piecewise constants, trigonometric 
polynomials, or algebraic polynomials. For such examples, we study the range of m such that (1.4) holds and 
show that this range is in accordance with stability results that can be proved for deterministic sampling. 
Numerical illustrations are given for algebraic polynomial approximation. 



2 Proofs 

Proof of Theorem [TJ The matrix G can be written as 

G = Xi + • • • + X„, 

where the X^ are i.i.d. copies of the random matrix 

X = -(£j(x)i fe (x))j,fe=i,..., TO , 

where x is distributed according to px ■ We use the following Chernoff bound from [5] , originally obtained 
by [T]: if Xi, . . . ,X„ are independent m x m random self-adjoint and positive matrices satisfying 

A max (Xi) = ira <r> 

almost surely, then with 

n n 

Mmin := A m i n fy^E(Xj)) and Mmax := A max fy^E(Xj) 



i=i 



one has 



and 



Pr (a^QT^) < (1 - 5)/w j < m ( (1 Jj)i-0 Mmin/J *' ° " 6 < X ' 

Pr |A max (^X,) > (l + *)/w| < ""( (i+^i+i )^"^' 5 ~ ° 

In our present case, we have 5ZiLi.E(Xj) = nE(X) = I so that fi m - ul = ^ max = 1. It is easily checked that 
(I+f)T+3 — (i-s) 1 -* f° r < <5 < 1, and therefore 

Pr{|||G - I||| > 6} < 2m( (1 _ 6 ^ ) 1/fl = 2mexp(-|). 

We next use the fact that a rank 1 symmetric matrix ab T = (bjCLk)j,k=i,...,m has its spectral norm equal to 
the product of the Euclidean norms of the vectors a and b, and therefore 

iiixiii<if> jW r=^, 

3=1 



G 



almost surely. We may therefore take R — K< ^f > which concludes the proof. □ 

Proof of Theorem [2j We denote by dp\ := <E) n dpx the probability measure of the draw. We also 
denote by f2 the set of all possible draws, that we divide into the set fi + of all draw such that 

ll|G-I|||<i 

and the complement set J7_ := fi \ f2 + . According to (1.3), we have 

Pr{ft_} = J dp n x < 2n- r , (2.1) 



under the condition (1.4). This leads to 



E(||/-/n = J \\f-fVd&< j \\f-PZfVdpl + 8L^- r , 

where we have used ||/ — f\\ 2 < 2L 2 , as well as the fact that Tl is a contraction that preserves /. 

It remains to prove that the first term in the above right side is bounded by (1 + e(n))e m (f) 2 . With 
g := / — P m f, we observe that 

/ - KJ = f- Pmf + P m P m f - P m f = a - P^g- 

Since g is orthogonal to V m , we thus have 

m 

ll/-^/l| 2 = ll.9l| 2 + IIOH 2 = ll.9l| 2 + EKI 2 ' 
where a = [aj)j=i,... >m is solution of the system 

Ga = b, 

with b := ((.g, ifc)n)fc=i, ..., m - When the draw belongs to f2 + , we have ||G _1 ||2 < 2 and therefore 

m m 

Eki 2 < 4 Ei^^)«i 2 - 

j=x fe=i 

It follows that 



J \\f-P m ffdp n x< J I ||. 9 || 2 + 4El(3,^)„| 2 ]^<||.9|| 2 + 4EE(|(.g,£ 
n + n + ^ fc =i ' k=1 

We estimate each of the E(\(g, L k } n \ 2 ) as follows: 

_^ n n 

E(\(g 7 L k ) n \ 2 ) = ^J2J2 E (9( x i)9^j)Lk(xi)L k (xj)) 

= \ (n(n - l)\E(g(x)L k {x))\ 2 + nE{\g{x)L k {x)\ 2 ) 

= (l--)\(g,L k )\ 2 + - f \g{x)\ 2 \L k {x)\ 2 dp x 
\ nJ n J 



k)n\ 2 ) 



X 



1 / \g(x)\ 2 \L k (x)\ 2 d Px , 



n 
x 
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where we have used the fact that g is orthogonal to V m and thus to L^. Summing over k, we obtain 

£b( n lo s( n ) 



where we have used ( 1.4 1. We have thus proven that 

/ 11/ - Plffdp n x < (1 + ^ )\\g\\ 2 = (1 + e(n))e m (ff, 

which concludes the proof. □ 

Proof of Theorem [3} We define the additive noise in the sample by writing 

Vi = f(Xi) + m, 
and thus the r\i are i.i.d. copies of the variable 

v = y- /(»• 

Note that r\ and x are not assumed to be independent. However we have 

E(ri\x) = 0, 

which implies the decorrelation property 

E(f]h(x)) = 0, 

for any function h. As in the proof of Theorem [2] we split fl into fl + and f2_ and find that 

E(||/-/T)< J \\f-w\\ 2 dpl+&M 2 n- r , 

where w now stands for the solution to the least squares problem with noisy data (j/i, . . . , y n ). With the 
same definition of g = f — P m f, we can write 

f-w = g- P"g - w, 

where w stands for the solution to the least squares problem for the noise data (771 , . . . ,r] n ). Therefore 

rn m 

11/ - PUW 2 = h\\ 2 + HO + ™ll 2 < Nl 2 + nP^gW 2 + 2|NI 2 = \g\\ 2 + 2]T Kf + 2^ \d 3 \ 2 , 

where a = (a,j)j=i t ... tm is as in the proof of Theorem [2] and d = (dj)j=i, •••>»" 1S s °l u ti° n °f the system 

Gd = n, 

with n := (A X)"=i r liI J k{xi))k=i,....m — (nk)k=i,....m- By the same arguments as in the proof of Theorem [2J 
we thus obtain 

m 

£(||/ - /I 2 ) < (1 + 2e(n))e m (/) 2 + 8L 2 n^ + 8 £ E(K| 2 ). 

fe=i 



We are left to show that J2™=i ^(\ n k\ 2 ) < For this we simply write that 

^ n n 

E(K| 2 ) = -J ^J^E^ifc^O^LfcCsj-)). 



»=i j=i 



For i ^ j, we have 
For i = j, we have 



EfaLfcfoJtyLfcfo)) = (E(7 7j L fc (x))) 2 = 0. 

E(|j 7i L fe ( a;i )| 2 )=E(|r/L fc (x)| 2 ) 

E(|T;Lfc(a;)| 2 |a;)dpx 



|2|„\|r f„M2, 



<a 2 J \L k (x)\ 2 dp x =a 2 . 
x 

It follows that E(|nA;| 2 ) < — , which concludes the proof. □ 

3 Examples and numerical illustrations 

We now give several examples of approximation schemes for which one can compute the quantity K(m) and 



therefore estimate the range of m such that the condition (1.4 1 holds. For each of these examples, we also 



exhibit a deterministic sampling (x\, . . . , x n ) for which the stability property 

ll|G-I|||4 

or equivalently 

|IMlMMI a |<^IMI a , v€V m , 

is ensured for the same range of m (actually slightly better by a logarithmic factor). For the sake of simplic- 
ity, we work in the one dimensional setting, with X a bounded interval. 

Piecewise constant functions. Here X = [a, b] and V m is the space of piecewise constant functions over 
a partition of X into intervals I\, . . . , I m , In such a case, an orthonormal basis with respect to L 2 (X, px) is 
given by the characteristic functions L). := {px{Ik))~ 1 ^' 2 Xi k , and therefore 

K(m) = max {pxih))' 1 - 

k — l,...,m 



Given a measure px, the partition that minimizes K(m), and therefore allows us to fulfill (1.4) for the largest 
range of m, is one that evenly distributes the measure px- With such partitions, K(m) reaches its minimal 
value 

K(m) = m, 



and (1.4) can be achieved with m ~ r J ^- 

' log n 



If we now choose n = m deterministic points x\, . . . , x m with Xk £ Ik, we clearly have 

HI 2 = H 2 , veV m . 
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Therefore the stability of the least squares problem can be ensured with to up to the value n using a deter- 
ministic sample. 



Trigonometric polynomials and uniform measure. Without loss of generality, we take X = [— ir,ir], 
and we consider for odd to = 2p + 1 the space V m of trigonometric polynomials of degree p, which is spanned 
by the functions Lj,(x) — e lkx for k = — p, . . . ,p. Assuming that px is the uniform measure, this is an 
orthonormal basis with respect to L 2 (X, px)- In this example, we again obtain the minimal value 

K(m) = to. 

Therefore dl.4|) can be achieved with m 



log n ' 

We now consider the deterministic uniform sampling x,i := —tt + for i = 1. . . . ,n. With such a sampling, 
one has the identity 

77 n n 

If 1 v- ■» 

v(x)dp x = tt I v{x)dx = - y~] v(xi), 
Ztt J n £ — ' 

for all trigonometric polynomials v of degree n — 1 (this is easily seen by checking the identity on every basis 
element). When v € V m with m = 2p+ 1, we know that \v\ 2 is a trigonometric polynomial of degree 2p. We 
thus find that 

HI* = H 2 > wev ro , 

provided that 2p < n — 1, or equivalently m < n. Therefore the stability of the least squares problem can 
be ensured with m up to the value n using a deterministic sample. 

Algebraic polynomials and uniform measure. Without loss of generality, we take X = [—1,1], and we 
consider V m — P m -i the space of algebraic polynomials of degree m — 1. When px is the uniform measure, 
an orthonormal basis is given by defining as the Legendre polynomial of degree k — 1 with normalization 

l|ifclU- ([ _ X)1]) = |i fc (i)| = V2fc^T, 

and thus 

m 

K(m) = 'Y J (2k - 1) = to 2 . 

k=l 



Therefore ( 1.4) can be achieved with to ~ . / which is a lower range compared to the previous examples. 



We now consider the deterministic sampling obtained by partitioning X into n intervals (ii, . . . , J n ) of equal 



10 



length — , and picking one point Xi in each 7j. For any v £ V m , we may write 



\v(x)\ 2 dp x - -\v{xi)\' 
n 



\v(x)\ 2 dx - -\v(xi)\ 



(\v(x)\ 2 - \v( Xi )\ 2 )dx 



< 2 / Hx)\ 2 ~\v( Xl )\ 2 \dx 



< I \(v 2 )'(x)\dx 



- / W{x)v{x)\dp x - 
n 



Summing over i, it follows that 

\\Ml 



v\\ 2 \ < 



\v'(x)v{x)\dp x < -\\v'\\ HI < 
n 



2(m - l) 2 



x 



where we have used the Cauchy-Schwarz and Markov inequalities. Therefore the stability of the least squares 
problem can be ensured with m up to the value ^ + 1 using a deterministic sample. 



Algebraic polynomials and Chebyshev measure. Consider again algebraic polynomials of degree 
m — 1 on X = [—1,1], now equipped with the measure 



dp x = 



dx 



Then an orthonormal basis is given by defining Lk as the Chebyshev polynomial of degree k — 1, with L\ = 1 
and 

Lk(x) = v2cos((/c — 1) arccosa;), 



for k > 1, and thus 



K(m) = 2m - 1. 



Therefore (1.4 1 can be achieved with m ~ jj^, which expresses the fact that least squares approximations 
are stable for higher polynomial degrees when working with the Chebyshev measure rather than with the 
uniform measure. 

We now consider the deterministic sampling obtained by partitioning X into n intervals (7i, . . . , 7 n ) of equal 



11 



Chebyshev measure px(Ii) = ^, and picking one point Xi in each Ii. For any v € V m , we may write 

J \v(x)\ 2 dp x - \\v{x^\ 2 = j{\v{x)\ 2 ^\v{x^\ 2 )dp x 



< J \Hx)\ 2 -\v( Xl )\ 2 \dp x 
Ii 



<Px{Ii) 



J \(v 2 )'(x)\da 



\ J W(x)v(x 



)\dx. 



Summing over i, it follows that 



Il2 ii 1 1 2 I 

\v\\i - Mr < 



i y |t/(a:)t;(a;)|da: < (tt ^ |«'(a;)|Vl - a? 2 ^) 1 ^- 



Using the change of variable x = cos t, it is easily seen that the inverse estimate 



/ \v'{x)\ 2 Vl-x 2 dx< (m-1) 2 / Ma;)| 2 - 1 
J J V 1 — x 2 

X X 



holds for any v £ V m . Therefore 



1 



-(--1) |H|2 



|IH| 2 -|H| 2 |< - / \v'(x)v(x)\dx < 
x 

which shows that the stability of the least squares problem can be ensured with m up to the value ^ + 1 
using a deterministic sample. 

Let us observe that in several practical scenarios, the measure px of the observations may be unknown 
to us, therefore raising the question of the behavior of K{m) for an arbitrary measure. 

It is not too difficult to check that when the space V m is not the trivial space of constant functions (which 
is the case as soon as m > 2) the quantity K{m) may become arbitrarily large for certain measures p X - We 
leave the proof of this general fact as an exercise for the reader, and rather provide a simple illustration: 
consider the space V2 of polynomials of degree 1 on [—1,1] and the measure px = ^X[- E ,e]( x )d-x where 
e > is small. Then an orthonormal basis is provided by the functions L a (x) = 1 and L\{x) = ^x, so 
that K{m) <~ e~ 2 . An interesting problem is to understand if for certain families of space (V m ), the quantity 
K{m) can be controlled under fairly general assumptions on the measure px- One typical such assumption 
is the the strong density assumption, which states that 



p x (E) ~ \E\, E measurable, 



(3.1) 



where | • | is the Lebesgue measure. In the case of piecewise constant functions on uniform partitions, or for 
more general spline functions on uniform grids, it is not difficult to check that this assumption implies the 
behavior K(m) ~ m. 
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Numerical illustration. We conclude with a brief numerical illustration of our theoretical results for 
the setting of algebraic polynomials. Specifically, we consider the smooth function fi(x) = 1/(1 + 25x 2 ) 
originally considered by Runge to illustrate the instability of polynomial interpolation at equispaced points, 
and the non-smooth function /^(cc) = \x\, both restricted to the interval [—1, 1]. 

For both functions, we take n i.i.d. samples x±, . . . , x n with respect to a measure px on X = [—1,1] and 
compute the noise- free observations yt — f{xi). We consider either the uniform measure px '■= 4p or the 
Chebyshev measure px '■= ^J"^ x - 2 ■ In both cases, we compute the least squares approximating polynomial 
of degree m using these points for a range of different values of m < n. We then numerically compute the 
error in the L 2 (X, px) norm, with px the corresponding measure in which the sample have been drawn, 
using the adaptive Simpson's quadrature rule [5] implemented in Matlab. 

Figure |3~T] shows the results of this simulation using n\ = 200 samples for estimating f\ and = 1000 
samples for estimating / 2 . We observe that, in all cases, as m approaches n the solutions become highly 
inaccurate due to the inherent instability of the problem. However, we can set m to be much larger before 
instability starts to develop when the points are drawn with respect to the Chebyshev measure, as is expected. 

Next we consider the effect of n on the best choice of m. Specifically, for any given sample of points 
we can compute the value m{n) that corresponds to the polynomial degree for which we obtain the best 
approximation to fx or fa and examine how this behaves as a function of n. This is shown in Figure 3.2 that 
displays as a function of n the average value of m(n) over 50 realizations of the sample, for both measures 
and both functions f\ and f% (the averaging has the effect of reducing oscillation in the curve n i-> m(n) 
making it more readable). We vary the sample size from n = 1 to 1000 for /2, but only from n = 1 to 200 
for the smooth function f±, since in that case the L 2 (X,px) error drops below machine precision for larger 
values of n with m in the regime where the least squares problem is stable and therefore the minimal value 
m(n) cannot be precisely located. 

We observe that, in accordance with our theoretical results, m(n) behaves like y/n when the points are 
drawn with respect to the uniform measure, while it behaves almost linear in n when the points are drawn 
with respect to the Chebyshev measure. 
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Figure 3.2: Optimal values m(n) as n varies (a) for fi (comparison with 0.7n and 2.5-^/n) and (b) for / 2 
(comparison with O.ln and 0.4i/n). 
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